Train Random Forest Classifier for Engagement Prediction and Find Most Important Features

A Random Forest classifier will be trained. The data set will consist of only the users that had their text classified by the text classifier. This restriction is added because we would like to include the percent of pro-protester tweets as a feature of this classification.

Resources: CS109 Homework #5


Imports


In [2]:
import numpy as np
import scipy as sp
import pandas as pd
import sklearn
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

Bring in data - text classification for users(pickle file), August(csv file), and November (csv file).


In [128]:
#users with classified tweets
user_tc = pd.read_pickle('final_aug_percents.pkl')

In [129]:
#count is the number of tweets with hashtag #Ferguson or #ferguson
#perc_p is the percent of the user's tweets that have been classified as 'pro-protester'
user_tc.head()


Out[129]:
user.screen_name count perc_p
0 ItriSamele 11 0.727273
1 gohogsgirl 28 0.857143
2 averrer 17 0.823529
3 17147578976 15 1.000000
4 fucking_ninoX_X 12 0.833333

In [5]:
#load august_reduced_all for data for Aug 10 - 17
df_aug = pd.read_csv('/home/data/aug_reduced_all.csv')


/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1154: DtypeWarning: Columns (0,2,3,4,6,7,10,11,12,14,15,17) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [45]:
list(df_aug.columns.values)


Out[45]:
['Unnamed: 0',
 'id',
 'lang',
 '_iso_created_at_x',
 'text',
 'user.id_x',
 'user.screen_name',
 'user.geo_enabled',
 'user.statuses_count',
 'user.friends_count',
 'user.lang',
 'user.name',
 'user.following',
 'user.followers_count',
 'retweeted',
 'in_reply_to_screen_name',
 'retweet_count',
 'entities.user_mentions',
 '_iso_created_at_y',
 'user.id_y',
 'retweeted_status.user.id',
 'retweeted_status.favorite_count',
 'retweeted_status.favourities_count',
 'retweeted_status.user.id.1',
 'retweeted_status.user.followers_count',
 'retweeted_status.user.friends_count',
 'in_reply_to_status_id',
 'retweeted_status.in_reply_to_status_id',
 'retweeted_status.in_reply_to_user_id']

In [7]:
df_nov = pd.read_csv('/home/data/nov_reduced.csv')


/home/ubuntu/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1154: DtypeWarning: Columns (0,1,2,3,5,6,7,8,9,10,11,13,14,16) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [8]:
df_nov.head()


Out[8]:
id lang _iso_created_at text user.id user.screen_name user.geo_enabled user.statuses_count user.friends_count user.lang ... entities.user_mentions retweeted_status.user.id retweeted_status.favorite_count retweeted_status.favourities_count retweeted_status.user.id.1 retweeted_status.user.followers_count retweeted_status.user.friends_count in_reply_to_status_id retweeted_status.in_reply_to_status_id retweeted_status.in_reply_to_user_id
0 538844861919928320 en 2014-11-30T00:00:00.000Z @ainsleyearhardt #ferguson doesn't deserve #Da... 24520375 PathFlounder False 3617 160 en ... [ { "id" : 186646579, "indices" : [ 0, 16 ], "... NaN NaN NaN NaN NaN NaN 5.388415e+17 NaN NaN
1 538844861181722624 en 2014-11-30T00:00:00.000Z Darren Wilson has resigned from the Ferguson p... 18956073 dcexaminer True 84607 12042 en ... [] NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 538844860866760705 en 2014-11-30T00:00:00.000Z RT @RT_com: BREAKING: Darren Wilson resigns in... 22842540 mickey228 False 69514 424 ja ... [ { "id" : 64643056, "indices" : [ 3, 10 ], "i... 64643056 120 NaN 64643056 814889 451 NaN NaN NaN
3 538844858832912384 en 2014-11-30T00:00:00.000Z .@BMUatYale ..."Which of #MyBrothersKeeper gon... 79838661 ugotGod False 96307 71 en ... [ { "id" : 1716317970, "indices" : [ 1, 11 ], ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 538844856278196224 en 2014-11-29T23:59:59.000Z RT @TheAnonMessage: BREAKING: Darren Wilson re... 511004909 Manny_Fresh21 True 27037 606 en ... [ { "id" : 423662810, "indices" : [ 3, 18 ], "... 423662810 149 NaN 423662810 98591 298 NaN NaN NaN

5 rows × 26 columns


In [9]:
nov = df_nov[['user.screen_name', '_iso_created_at']]

In [10]:
nov_df = pd.DataFrame({'count' : nov.groupby( [ "user.screen_name"] ).size()}).reset_index()

In [11]:
nov_df.head()


Out[11]:
user.screen_name count
0 000120o 43
1 000Dillon000 15
2 000RowanPark000 25
3 000kek 6
4 007JamesBong_ 11

Prepare features in August data

The following features will be used:

  • average number of friends
  • average number of followers
  • maximum number of total tweets
  • number of replies in the #F/ferguson tweets
  • number of retweets in the #F/ferguson tweets
  • percentage of #F/ferguson tweets that were replies
  • percentage of #F/ferguson tweets that were reweets

In [ ]:
#group by id and get average friends
fr_df = df_aug[['user.screen_name', 'user.friends_count']].dropna(how='all')
friends = fr_df.groupby(['user.screen_name'], as_index=False).mean()

In [ ]:
#group by id and get average followers
fo_df = df_aug[['user.screen_name', 'user.followers_count']].dropna(how='all')
fo_df['user.followers_count'] = fo_df['user.followers_count'].convert_objects(convert_numeric=True)
followers = fo_df.groupby(['user.screen_name'], as_index=False).mean()

In [ ]:
#group by id and get max of total tweets
tt_df = df_aug[['user.screen_name', 'user.statuses_count']].dropna(how='all')
tt_df['user.statuses_count'] = tt_df['user.statuses_count'].convert_objects(convert_numeric=True)
total_tweets = tt_df.groupby(['user.screen_name'], as_index=False).max()

In [38]:
#restrict august dataframe to users in user_tc dataframe
aug = df_aug[(df_aug['user.screen_name'].isin(user_tc['user.screen_name']))].reset_index()

#http://stackoverflow.com/questions/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe

In [61]:
#get reply count for each user
for i in range(0, len(user_tc)):
    count = aug[aug['user.screen_name'] == user_tc.loc[i, 'user.screen_name']]['in_reply_to_screen_name'].count()
    user_tc.loc[i, 'total_replies'] = count

In [64]:
#get retweet count for each user
for i in range(0, len(user_tc)):
    count = aug[aug['user.screen_name'] == user_tc.loc[i, 'user.screen_name']]['retweeted_status.user.id'].count()
    user_tc.loc[i, 'total_retweets'] = count

In [69]:
#calculate %retweets and %replies for the tweets with the #F/ferguson hashtag
user_tc['pct_replies'] = user_tc['total_replies'] / user_tc['count']
user_tc['pct_retweets'] = user_tc['total_retweets'] / user_tc['count']

In [70]:
user_tc.head()


Out[70]:
user.screen_name count perc_p total_replies total_retweets pct_replies pct_retweets
0 ItriSamele 11 0.727273 0 0 0.000000 0.000000
1 gohogsgirl 28 0.857143 1 8 0.035714 0.285714
2 averrer 17 0.823529 0 0 0.000000 0.000000
3 17147578976 15 1.000000 0 14 0.000000 0.933333
4 fucking_ninoX_X 12 0.833333 0 7 0.000000 0.583333

In [72]:
#merge all feature dataframes together
features = user_tc.merge(friends, on = 'user.screen_name', how = 'inner').merge(followers, on='user.screen_name', how='inner').merge(total_tweets, on='user.screen_name', how='inner')

#code help from http://stackoverflow.com/questions/23668427/pandas-joining-multiple-dataframes-on-columns

In [73]:
features.head()


Out[73]:
user.screen_name count perc_p total_replies total_retweets pct_replies pct_retweets user.friends_count user.followers_count user.statuses_count
0 gohogsgirl 28 0.857143 1 8 0.035714 0.285714 1072.777778 724.666667 16987.777778
1 17147578976 15 1.000000 0 14 0.000000 0.933333 155.785714 15.000000 144.928571
2 fucking_ninoX_X 12 0.833333 0 7 0.000000 0.583333 311.000000 384.000000 9957.000000
3 76stephc 17 0.882353 0 5 0.000000 0.294118 237.800000 40.200000 1148.400000
4 erinisinire 14 0.857143 0 8 0.000000 0.571429 723.700000 312.500000 5412.800000

Determine if users remained engaged

We define a user as having remained engaged if they tweets 10 or more times with the hashtag #F/ferguson during Nov. 25 - Dec. 1.


In [74]:
#merge the august features dataframe with the november dataframe
aug_nov = pd.merge(features, nov_df, on='user.screen_name', how='left')

In [109]:
#determine if user remained engaged (1 = yes, 0 = no)
for i in range(0, len(aug_nov)):
    if aug_nov.loc[i, 'count_y'] >= 10:
        aug_nov.loc[i, 'rem_eng'] = 1
    else:
        aug_nov.loc[i, 'rem_eng'] = 0

In [110]:
aug_nov.head()


Out[110]:
user.screen_name count_x perc_p total_replies total_retweets pct_replies pct_retweets user.friends_count user.followers_count user.statuses_count count_y rem_eng
0 gohogsgirl 28 0.857143 1 8 0.035714 0.285714 1072.777778 724.666667 16987.777778 2 0
1 17147578976 15 1.000000 0 14 0.000000 0.933333 155.785714 15.000000 144.928571 NaN 0
2 fucking_ninoX_X 12 0.833333 0 7 0.000000 0.583333 311.000000 384.000000 9957.000000 9 0
3 76stephc 17 0.882353 0 5 0.000000 0.294118 237.800000 40.200000 1148.400000 NaN 0
4 erinisinire 14 0.857143 0 8 0.000000 0.571429 723.700000 312.500000 5412.800000 NaN 0

Calculate % remaining engaged in this sample.


In [111]:
rem_eng = len(aug_nov[aug_nov['rem_eng'] == 1])
not_rem_eng = len(aug_nov[aug_nov['rem_eng'] == 0])
print rem_eng*1.0 / (rem_eng + not_rem_eng)


0.31844316674

Prepare X and Y data sets

Prepare the X and Y data sets for the random classifierl


In [112]:
#convert rem_eng to numpy array called Y
Y = np.array(aug_nov.rem_eng)

In [113]:
#drop user.screen_name
features_d = features.drop('user.screen_name', axis = 1)
features_d = features_d.fillna(0)
features_d.head()


Out[113]:
count perc_p total_replies total_retweets pct_replies pct_retweets user.friends_count user.followers_count user.statuses_count
0 28 0.857143 1 8 0.035714 0.285714 1072.777778 724.666667 16987.777778
1 15 1.000000 0 14 0.000000 0.933333 155.785714 15.000000 144.928571
2 12 0.833333 0 7 0.000000 0.583333 311.000000 384.000000 9957.000000
3 17 0.882353 0 5 0.000000 0.294118 237.800000 40.200000 1148.400000
4 14 0.857143 0 8 0.000000 0.571429 723.700000 312.500000 5412.800000

In [114]:
#convert features to matrix
X = features_d.as_matrix()

Features key:

  • count: number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson
  • perc_p: the percent of the user's tweets that were classified by the text classifier as 'pro-protester'
  • total_replies: number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson that were replies
  • total_retweets: number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson that were retweets
  • pct_reples: percent of tweets in count that were replies
  • pct_retweets: percent of tweets in count that were retweets
  • user.friends_count: average number of user's friends betwen Aug. 10 and Aug. 17
  • user.followers_count:average number of user's followers between Aug. 10 and Aug. 17
  • user.statuses_count: total number of user's lifetime tweets as of their last tweet in the Aug. data set

Train Random Forest Classifier and Generate Score Plots

A Random Forest Classifier will be trained, and score plots will be generated to help choose the best number of trees.


In [117]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

#calculate cross val scores for each random forest
scores = []
for i in range(1, 31):
    rfc = RandomForestClassifier(n_estimators=i)
    score = cross_val_score(rfc, X, Y, cv=10)
    scores.append(score)

#calculate mean score for each random forest
score_means = np.mean(scores, axis=1)
trees = np.arange(30)+1

plt.figure(figsize=(15,8))
plt.scatter(trees, score_means, c='k', zorder=2)
sns.boxplot(scores)
plt.xlabel("Number of Trees")
plt.ylabel("Cross Validation Score")
plt.title("Cross Validation Score vs. Number of Trees")
plt.show()


16 trees looks good. Now we'll try with F1 scores.


In [91]:
#calcuate F1 scores for each random forest
scores_f1 = []
for i in range(1, 26):
    rfc = RandomForestClassifier(n_estimators=i)
    score = cross_val_score(rfc, X, Y, cv=10, scoring='f1')
    scores_f1.append(score)

#caluclate mean F1 score for each random forest
score_means_f1 = np.mean(scores_f1, axis=1)
trees = np.arange(25)+1

plt.figure(figsize=(15,8))
plt.scatter(trees, score_means_f1, c='k',zorder=2)
sns.boxplot(scores_f1)
plt.xlabel("Number of Trees")
plt.ylabel("F1 Score")
plt.title("Cross Validation Score using F1 Scoring vs. Number of Trees")
plt.show()


16 trees still looks good, but we know this data set is unbalanced, so we'll try with a custom cut-off.


In [92]:
def cutoff_predict(clf, X, cutoff):
    #generate prediction probabilities
    prob = clf.predict_proba(X)
    
    #convert probabilites to predictions
    clf_p = np.empty(len(prob))
    for i in range(len(prob)):
        if prob[i][1] > cutoff:
            clf_p[i] = 1
        else:
            clf_p[i] = 0
    return clf_p

#code from Homework #5

In [93]:
def custom_f1(cutoff):
    def f1_cutoff(clf, X, y):
        ypred = cutoff_predict(clf, X, cutoff)
        return sklearn.metrics.f1_score(y, ypred)
        
    return f1_cutoff

#code from Homework #5

#set range of cutoffs
cutoff_range = np.arange(0.1, 0.9, 0.1)

#set up Random Forest Classifier
rfc_c = RandomForestClassifier(n_estimators=15)

#Calculate custom F1 scores for each random forest
scores_cc = []
for i in cutoff_range:
    score_cc = cross_val_score(rfc_c, X, Y, cv=10, scoring = custom_f1(i))
    scores_cc.append(score_cc)
    
plt.figure(figsize=(15,8))
sns.boxplot(scores_cc, names=cutoff_range)
plt.xlabel("Cutoff Value")
plt.ylabel("F1 Score (using custom cutoff value)")
plt.title("Cross Validation Score with F1 Scoring vs. Cutoff Value")
plt.show()



In [100]:
#calcuate F1 scores for each random forest
scores_f1 = []
for i in range(1, 26):
    rfc = RandomForestClassifier(n_estimators=i)
    score = cross_val_score(rfc, X, Y, cv=10, scoring=custom_f1(0.2))
    scores_f1.append(score)

#caluclate mean F1 score for each random forest
score_means_f1 = np.mean(scores_f1, axis=1)
trees = np.arange(25)+1

plt.figure(figsize=(15,8))
plt.scatter(trees, score_means_f1, c='k',zorder=2)
sns.boxplot(scores_f1)
plt.xlabel("Number of Trees")
plt.ylabel("F1 Score")
plt.title("Cross Validation Score using Custom F1 Scoring (Cutoff = 0.2) vs. Number of Trees")
plt.show()


With the custom cutoff F1 score, 6 trees seems optimal. The most feature importances will be generated using 6 trees.

Determine importance of features


In [127]:
#set up and fit a random forest classifier
rfc_id = RandomForestClassifier(n_estimators=6)
rfc_id.fit(X, Y)
#calculate feature importances
importances = rfc_id.feature_importances_

index = np.arange(len(importances))
plt.bar(index, importances)
plt.xticks(index+0.5,features_d.columns.values, rotation=75)
plt.show()


The most important features are:

  • the tweet count, which is the number of tweets the user made between Aug. 10 and Aug. 17 with the hashtag #F/ferguson
  • the total number of tweets in the user's lifetime, as of the last tweet in our August dataset
  • the user's followers count

Show Pair Plots

The pair plots for the features were generated, using the rem_eng (remained engaged) value as the hue. These plots require further analysis to identify interesting trends.


In [115]:
#prep dataframe for pair plots
all_aug_nov = aug_nov.drop('user.screen_name', axis = 1)
all_aug_nov = all_aug_nov.drop('count_y', axis = 1)

In [87]:
all_aug_nov.head()


Out[87]:
count_x perc_p total_replies total_retweets pct_replies pct_retweets user.friends_count user.followers_count user.statuses_count rem_eng
0 28 0.857143 1 8 0.035714 0.285714 1072.777778 724.666667 16987.777778 1
1 15 1.000000 0 14 0.000000 0.933333 155.785714 15.000000 144.928571 0
2 12 0.833333 0 7 0.000000 0.583333 311.000000 384.000000 9957.000000 1
3 17 0.882353 0 5 0.000000 0.294118 237.800000 40.200000 1148.400000 0
4 14 0.857143 0 8 0.000000 0.571429 723.700000 312.500000 5412.800000 0

In [88]:
sns.pairplot(all_aug_nov, hue='rem_eng')


Out[88]:
<seaborn.axisgrid.PairGrid at 0x7fce93980410>